Quality statistical charts: introduction to data visualisation

Tomasz Przechlewski
email:
affiliation: Powiślańska Szkoła Wyższa (Kwidzyn/Poland)

2024

How to lie with statistics

Less known (now; once bestseller) book by Darrell Huff (142 pages/a5)

This is not Darrell Huff but look what Bill Gates recommended in 2015
This is not Darrell Huff but look what Bill Gates recommended in 2015

https://www.gatesnotes.com/About-Bill-Gates/Summer-Books-2015?WT.mc_id=05_19_2015_SummerBooks_GeekWire

BTW: this photo (taken in 2015) coupled with the fact that Gates funded the epidemiology research at John Hopkins University has become “evidence” for various morons (of which the are plenty in the USA), that Gates was behind the COVID19 pandemic

A book written by Darrell Huff in 1954 presenting an introduction to statistics for the general reader. Not a statistician, Huff was a journalist […]

In the 1960/1970s, it became a standard textbook introduction to the subject of statistics for many college students […] one of the best-selling statistics books in history.

https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

The book consists of 10 chapters and is written in a provocative, way (unscientific). Individual chapters are so well known that if you enter the title of the chapter into google will return hundreds of thousands references

ch1: The Sample with the Built-in Bias (ie it is very difficult to draw unbiased/perfect random sample)

ch2: The Well-Chosen Average. You can manipulate average value in various ways: using various averages/using different definitions of averaged units/measuring in various ways

ch3: The Little Figures That Are Not There (Figures = Details) Reporting results w/o context or important information in short

ch4: Continuing #ch3 insignificant results = difference is of no practical meaning.

ch5: The Gee-Whiz Graphs (Statistical graphs in cartesian coordinates with OY axis not starting from zero) https://en.wikipedia.org/wiki/Gee_Whizhttps://en.wikipedia.org/wiki/Misleading_graph

ch6: The One-Dimensional Picture (comparing 1D quantities using 2D or pseudo-3D) https://thejeshgn.com/2017/11/17/how-to-lie-with-graphs/

ch7: The Semiattached Figure. Using one thing as a way to claim proof of something else, even though there’s no correlation between the two (not attached) https://www.secjuice.com/the-semi-attached-figure/

ch8: Post Hoc Rides Again (Correlation is not causation)

ch9: Misinforming people by the use of statistical material might be called statistical manipulation, in a word, Statisticulation. (summary of ch1–ch8)

ch10: How to Talk Back to a Statistic (How not to be deceived)

Who Says So? (interested parties can be unreliable; car seller reputation is poor);

How Does He Know? (measurement is often unreliable);

What’s Missing? (incomplete analysis signals bias);

Many figures lose meaning because a comparison is missing. In Poland there was a public discussion about falling fertility– women in Poland do not give birth to children; the average age of a mother at the birth of her first child is 27 years. [It is a norm in a whole Europe]

Did Somebody Change The Subject? (beware of the Semiattached Figure)

Does It Make Sense? (forget about statistics and think about common sense)

Despite its mathematical base, statistics is a much an art as it is a science (Huff p. 120)

Is it better now?

Unfortunatelly quite opposite…

Misleading statistical analyzes are still doing quite well if not better than in Huff’s times, which is probably due to the following factors:

Not only charts can be misleading (intentionally or not), but this lecture is about charts. Because charts are ubiquitous. Because statistical charts have become a favorite way for the media, including electronic and social media, to present results (we are inundated with charts that aim to prove something).

Why Are Statistical Charts Created?

Statistical charts can be created for the following three purposes:

  1. Decorative (To attract someone’s attention; a document without images is dull, colorful pictures are better than black-and-white ones; fancy drawings are better than simple ones. Form is king; content does not matter.)

  2. Explanatory (To better explain a certain phenomenon to someone. It is often said that a picture is worth a thousand words.)

  3. Exploratory (To identify data patterns during the exploratory/preliminary stage of data analysis.)

I will focus on the second point, i.e., on effective graphical methods for explaining relationships in data. One graphical method is more effective than another if the information it contains can be interpreted more efficiently or easily by the audience [Robbins 2005].

Types of Charts

Some charts are better than others:

Historical Corner

Bar charts, line charts, and pie charts were invented by William Playfair (an economist!) in the 18th century. Dot plots were created by John Cleveland in the 1980s. Box plots were introduced by John Tukey in the 1970s.

More Playfair’s charts one cane find via google or in [Syamnzik’s paper] (http://www.math.usu.edu/symanzik/papers/2009_cost/editorial.html)

Florence Nightingale also worked with statistics. The chart below is called the Nightingale Rose. It is a type of stacked bar chart, but in a polar coordinate system

There are twelve sectors (polar bars) — one for each month.

The length of the radius, and thus the area of the sector, depends on the magnitude of the phenomenon it represents (the number of deaths due to: wounds, diseases, and other causes).

FN diagrams (Nightingale’s diagrams) didn’t catch on, but not every new idea is instantly brilliant…

Graphic Metaphors (aka Graphic Features)

Data visualization involves encoding relationships between numbers (quantitative information) using graphic metaphors (e.g., geometric shapes, angles, colors, positions, etc.).

Some metaphors are more effective than others in terms of clarity and accuracy.

According to William S. Cleveland (known for stripcharts) and Robert McGill in their seminal paper Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods (JASA, 1984), graphic metaphors can be ranked by effectiveness as follows: Position along a common scale ➔ Position along identical, nonaligned scale ➔ Length ➔ Slope or direction/Angle ➔ Area ➔ Volume (pseudo-3D graphics) ➔ Color (hue, saturation, or black density)

source: Alberto Cairo 2016
source: Alberto Cairo 2016

Key observations:

  1. Position is the most effective Judging distances along a common scale is precise and intuitive for viewers.

  2. Angles are less effective Humans struggle to compare angles accurately, especially when differences are small. Acute angles tend to be underestimated, while obtuse angles (greater than 90°) are overestimated.

  3. Area comparisons are imprecise Differentiating between objects of similar areas is highly challenging.

  4. Color has low effectiveness While visually appealing, colors (whether hue, saturation, or density) are poor for conveying precise quantitative differences.

These findings highlight why simple, position-based visuals like bar charts and scatter plots outperform complex visuals like pie charts or bubble charts.

NUTS and TERYT

The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)

NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)

NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – subregion (several counties in case of Poland)

Poland is divided into 7 macroregions, 16 states (NUTS2), and 72 subregions (NUTS3).

NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )

There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce )

The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW province in Polish is “prowincja” (due to both are from Latin) but actually Polish administrative provice is called “województwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every province ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland)

NUTS3 consists of 380 counties grouped into 72 subregions.

A Polish county (called “powiat”) is 2-nd level administrative unit.

In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta”. “Stary” means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)

The 3rd level administrative unit is called gmina (municipality).

There are (approximately) 380 counties and 2750 municipalities in Poland.

As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each municipality has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina”.

TERYT is a Polish NUTS (developed some 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “województwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-municipality (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).

So you are now experts on administrative division of Poland, and we can go back to statistical charts…

One Variable

Categorical Variable: pie charts and bar charts

Example 1: Municipalities in Poland by type (source: Local Data Bank of the Central Statistical Office of Poland/BDL)

Bar chart

Pie chart

If there are few values, a pie chart is fine, but why visualize just three numbers?

Example 2: Land Use in Poland as a Percentage of Total Area (source: BDL)

In this example, the variable takes on more values, which immediately demonstrates the weaknesses of the pie chart.

https://bdl.stat.gov.pl/bdlarch/metadane/podgrupy/441?back=True agricultural land | forests | lands under water

Bar Chart

Pie chart:

Example 3: Nights spent at tourist accommodation establishments by non residents (2017); Noclegi udzielone w roku 2017 wg krajów UE. Source: Eurostat tour_occ_ninat

Pie charts:

Bar charts:

One Variable cont.

Quantitative Variable: histogram

A histogram is a graphical representation of the distribution of a dataset. It shows how frequently each value (or range of values) occurs within the dataset.

Example: The age of Nobel Prize laureates (up to 2018); Source: The Nobel Prize API Developer Hub)

Histograms with a bin (interval) width of 10, 5, 2 and 1 years:

Comparison of distributions

Qualitative variable

stacked barchart vs grouped barchart

Gruped bar chart

Stacked bar chart

or

Comparison of three provinces reveals the limitations of pie charts (with more numbers, the pie chart becomes unreadable/ineffective).

Land use in selected provinces

Still insists on using pie charts? 😊

Another example

CBOS (leading Polish government-funded research institute focused on public opinion polling) conducts the survey “Current Problems and Events” at least 12 times a year, on a representative sample of approximately 1,000 adult residents of Poland. (cf https://www.cbos.pl/PL/trendy/trendy.php?)

In this research trust in politicians is measured. This trust is assessed through a single question, which reads as follows:

Public figures—through their actions, what they say, and their goals—evoke varying degrees of trust. We will now present you with a list of individuals active in the political life of our country. For each of them, please indicate the extent to which they inspire your trust. When responding, please use a scale where -5 means that you have deep distrust for the person, 0 means that you are indifferent toward them, and +5 means that you have full trust in them. Of course, you may also use other points on the scale. If you are not familiar with someone, please let us know.

The percentages of respondents expressing trust correspond to ratings from +1 to +5, distrust corresponds to ratings from -1 to -5, and indifference is represented by a rating of 0.

In its summaries, CBOS excludes responses of “difficult to say” (indifference) and refusals to answer.

Stacked barchar

Panel of barcharts

Panel of piecharts (to convince those who remain unconvinced)

Quantitative Variable: box and whisker plot

Box plots are much better than histograms for comparing distributions.

Construction of a (typical) box plot:

Notice the trick: outliers are not defined as (for example) the upper/lower 1% of all values (because then every distribution would have outliers); rather, they are values smaller/larger than Q∗±1.5×IQRQ∗​±1.5×IQR.

All values in distributions with moderate variability fit within such a range.

Example: the age of Nobel Prize laureates.

Quantitative Variable: strip charts

A strip chart represents the distribution of values along an axis. Such a plot can be used as an alternative to a box plot (because it retains more information about the data).

Example: the age of Nobel Prize laureates.

A serious problem with a strip plot is overlapping points.

There is no perfect solution to this problem, but several techniques can help: use smaller dots, use semi-transparent dots (right panel), or apply jitter.

Jitter is a small random noise added to the data (below; larger jitter in the right panel).

Combining box plots with strip charts

Two variables

Purpose: to show relationships between two (or more–bubble plot) numeric variables…

Example: GDP versus CO2 emissions

https://data.worldbank.org/indicator/EN.GHG.CO2.MT.CE.AR5?end=2022&start=1970&view=chart Carbon dioxide (CO2) emissions (total) excluding LULUCF (Mt CO2e)

Unitofmeasure: Mt CO2eq ??? I believe it is Millions metric tonnes

Baloon plot

GDP vs CO2 emissions cont.

Time series

Line plot or bar plot

Purpose:

Example: Per capita emissions of CO2 equivalents (metric tonnes)

Line plot

Barchart

Stacked barchart

Panel

BTW: there is a problem in EN.GHG.CO2.MT.CE.AR5 description at

https://data.worldbank.org/indicator/EN.GHG.CO2.MT.CE.AR5?end=2022&start=1970&view=chart

https://data.worldbank.org/indicator/EN.GHG.CO2.MT.CE.AR5?end=2022&start=1970&view=map

Climate disaster uttery important, yeh?

General Design Principles

  1. Understandable Content. Ensure the reader clearly understands what the chart represents. Include axis labels, scale descriptions, and necessary explanations.

  2. Clear Form. The reader must easily see the presented information. Avoid tangled lines, overlapping elements, or clutter.

  3. Emphasize Data. Highlight the data, not unnecessary elements like grid lines, redundant legends, or meaningless arrows. Keep the design simple.

  4. Axis and Labeling Guidelines. Place tick marks and axis labels externally to avoid clutter. X-axis values should increase from left to right, and Y-axis values from bottom to top—never the reverse. Use a reasonable number of axis labels to avoid overcrowding.

  5. Accessibility and Scalability. Design for readability in black-and-white mode or when scaled down (e.g., photocopied or viewed on a smartphone).

  6. Avoid Overcomplication. Use only as many visual metaphors as there are data dimensions. Bar chart rectangles should be uniform in color. 3D charts are a disaster. They add complexity without improving clarity.

  7. Optimize Baselines and Proportions.

    Use a shared baseline when possible, especially for comparison purposes.

    For line charts, aim for a 45° slope for optimal proportions.

    Use logarithmic scales for large data ranges but avoid truncated scales unless necessary.

    Start axes at 0 unless a specific exception justifies otherwise.

    Avoid dual axes, as they complicate interpretation.

  8. Prefer Labels Over Legends.

    Labels placed directly on the chart are preferable to legends.

    Use a legend only when space constraints prevent the use of labels.

  9. Avoid Multi-Line Charts.

    Multi-line charts are generally problematic due to: – Multiple scales; – Visual clutter; – Difficulty in assessing differences between lines.

By adhering to these principles, your charts will be more effective, clear, and easier to interpret.

Edward Tufte’s recommendations

Edward Tufte, a renowned expert in data visualization, proposed two key principles to enhance the clarity and integrity of visualizations:

Data-to-Ink Ratio (DI)

Definition: The proportion of “ink” (visual elements) dedicated to representing the data versus all the ink used in the chart.

Maximize the data-to-ink ratio, meaning: Minimize decorative or non-essential elements. Focus on presenting as much data as possible in a clear and concise manner.

https://www.youtube.com/watch?v=JIMUzJzqaA8

Practical Advice:

Remove unnecessary grid lines, shading, or other embellishments. Ensure every visual element serves a purpose in communicating data.

Lie Factor (LF)

Definition: The ratio of the graphical representation of a value to the actual data value.

Ideal Value: LF should equal 100% for accurate representation. LF > 105% or < 95% signifies significant distortion of the data.

Example: Average Female Heights

The chart shows average female heights across various countries.

Latvian women appear to be 4.35 times taller than Indian women based on the chart’s visual effect (135/31≈4.35135/31≈4.35).

However, the actual height difference is far smaller, roughly 169.8 cm/152.6 cm≈1.11169.8 cm/152.6 cm≈1.11 times.

This discrepancy violates Tufte’s Lie Factor rule, as the LF is significantly higher than 105%, misleading the audience.

Summary of Tufte’s Principles

More subtle example

This giant guy (GG) in the middle is our ex-president. The guy next to him on the left is our current president Duda. Next to Duda is ex-rock star Kukiz, dark-horse of the elections. This is the cover (slightly modified) of influential polish weekly magazine form May 2015, shortly before elections.

The figures are claimed to be in-sync with the recent survey results (sort of a barchart). Could you figure-out from that chart about the proportion of scores of each candidate? How much the giant-guy outperforms the runner-up candidate? Which candidate is supported by this influential magazine (easy:-)?

The lie-factor details:

The line from shoes to top of the head equals (at certain size of course) 204mm for GG, 134mm for Duda and 42.5mm for ex-rock star. So \(204/134=1.5\) and \(204/42.5 \approx 4.8\). As \(44/29 \approx 1.5\) and \(44/9 \approx 4.8\) as well formally the lieFactor is perfect. But should one compares lengths or areas?

If one compares areas not heights, one get significantly different (and correct) results, namely: \((204 * 58) /(134 * 21)= 4.20\) and \((204 *58)/(42.5 *15) \approx 18.56\). Lie factor is \(4.2/1.5 =280\)% and \(18.56/4.8=387\)% respectively. Huge distortion

Moreover two more tricks were applied to boost GG. Can you see them?

BTW: the text in the pink frame claims: “figure ratios are consistent with april-may survey outcome.”” (But what exactly figure ratios means?)

Banking to 45

The question is which aspect ratio is the best.

We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.

Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.

Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).

Exercise: assess which slope is the steepest one and which is the smallest one?

BTW: every chart presents the same data on CO2 emission (average for May each year) as provided by US Government’s Earth System Research Laboratory, Global Monitoring Division. (cf CO2 PPM - Trends in Atmospheric Carbon Dioxide)

Scale

How many Nobel Prizes have Poles received?

I asked AI

Why even AI has problems?

https://www.youtube.com/watch?v=arKhvVWGXFo

A logarithmic scale should be used when the dataset being visualized has a large range.

As an example, let’s once again consider Nobel Prize laureates, this time by country of birth (bornCountryCode)…

Scatter plots using different scales on the Y-axis (arithmetic, log2, and log10).

Exact data:

country bornCountryCode n
United States US 269
United Kingdom GB 100
Germany DE 82
France FR 55
Sweden SE 29
Japan JP 26
Russia RU 26
Poland PL 25
Canada CA 19
Italy IT 19
Netherlands NL 18
Austria AT 17
Switzerland CH 17
China CN 12
Denmark DK 12
Norway NO 12
Australia AU 10
Belgium BE 9
Hungary HU 9
South Africa ZA 9
India IN 8
Spain ES 7
Czechia CZ 6
Egypt EG 6
Israel IL 6
Finland FI 5
Ireland IE 5
Ukraine UA 5
Argentina AR 4
Belarus BY 4
Romania RO 4
Lithuania LT 3
Mexico MX 3
New Zealand NZ 3
Pakistan PK 3
Turkey TR 3
Bosnia & Herzegovina BA 2
Chile CL 2
Colombia CO 2
Algeria DZ 2
Guatemala GT 2
Iran IR 2
South Korea KR 2
St. Lucia LC 2
Liberia LR 2
Luxembourg LU 2
Portugal PT 2
Timor-Leste TL 2
Azerbaijan AZ 1
Bangladesh BD 1
Bulgaria BG 1
Brazil BR 1
Costa Rica CR 1
Cyprus CY 1
Ghana GH 1
Guadeloupe GP 1
Greece GR 1
Croatia HR 1
Indonesia ID 1
Iceland IS 1
Kenya KE 1
Latvia LV 1
Morocco MA 1
Madagascar MG 1
North Macedonia MK 1
Myanmar (Burma) MM 1
Nigeria NG 1
Peru PE 1
Slovenia SI 1
Slovakia SK 1
Taiwan TW 1
Venezuela VE 1
Vietnam VN 1
Yemen YE 1
Zimbabwe ZW 1

PL – 25 Nobel Prizes 😊 (mainly Germans and (Russian) Jews born in German/Russian Empires respectively)

Examples of Poor Charts

Malbork castle, 40 kms from PSW https://www.youtube.com/watch?v=PGkpg9wd3ak

A reviewed paper on tourist traffic in the museum of Malbork Castle titled Parzych Krzysztof, The determinants of the tourist traffic in the castle’s museum of Malbork, Journal of Education, Health and Sport.

This paper demonstrates all the textbook mistakes discussed earlier:

more readable charts (if one insists on using pie charts):

barcharts better, as usual:

It can get even worse (yes we can:-))

What is this?

Pie charts are known for their mediocrity

Bar charts can also be spectacularly ruined

The distribution of seats in the Sejm after the 2015 elections

Why did Polish teachers protest?

A frequently shown chart aimed at convincing public opinion that teachers are much worse off than before: (Average salary as a % of the overall average?)

If you start from zero, it does not look so dramatic…

Examples of Poor Charts cont.

Ruble crash according to NYT

The collapse of the ruble exchange rate in February/March 2022. What is very wrong with the chart?

Thank you

Lecture notes/handouts and data sets are available here: https://github.com/hrpunio/Erasmus_2024_Sousse

Baltic sea in the morning
Baltic sea in the morning